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(54) Document Production 



(57) Structured-format documents are produced in 
a process in which a file in a particular word processing 
format (input A) or in any other format (Input B) are con- 
verted (2) to a particular word processor format. The 
system loads a parameter activation table which sets 
document parameter values to allow DTDs to be auto- 
matically implemented. The document is cleaned (5) 
and tagged (6). The tagging provides an important link 
to allow automatic conversion at a later stage in the 
process. There is copy-editing (7) followed by validation 
of the file preparation stage. This involves automatic val- 
idation of tags, including validation of their order and 
nesting arrangement. Automatic conversion to SGML is 
performed in a sequence of symbol/character conver- 
sion (20), tag conversion (21 ). equation processing (22), 
and floating element processing (23). Final validation 
(24) is then performed. 
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Description cording to the publication parameter values; and 



Field of the Invention 

[0001] The invention relates to production of docu- 5 
ments in a structured format such as in Standardized 
General Markup Language (SGML) format. 

Prior Art Discussion 

[0002] A structured format such as SGML allows out- 
put of a document in a wide variety of formats using 
available tools. Such a structured format is therefore of 
enormous benefit to the document production industry, 
such as for publication of academic journals. In the art, 
W098/34179, US5557720, and US51 40521 describe 
techniques for processing structured-format docu- 
ments. In general, this prior art relates to either altering 
a structured-format document, or processing such doc- 
uments to generate a required output format for either 
display or printing. 

[0003] However, a major problem for production of 
documents in a structured format is that of reaching this 
format. If the document is authored in the structured for- 
mat, then specialised knowledge is required and the 
task is time-consuming. Alternatively, if the document is 
authored in a conventional word processor format and 
ts subsequently, converted, the conversion is very time- 
consuming and is error-prone. 

Objects of the Invention 

[0004] The invention is therefore directed toward pro- 
viding a process for producing a document in a struc- 
tured format in a more efficient manner. 
[0005] Another object is that errors in the document 
be consistently reduced. 

SUMMARY OF THE INVENTION 

[0006] According to the invention, there is provided a 
document production process carried out by a system 
comprising a processor having an editor interface and 
memory access means, the process comprising the 
steps of:- 

writing a document comprising characters in a word 
processor format to memory; 

writing document publication parameter values to 
memory: 

automatically correcting the document according to 
typesetting rules; 

automatically tagging the document to delimit char- 
acter strings by inserting tags next to the associated 
character strings, the tagging being performed ac- 



automatically converting the document to a struc- 
tured format by substituting tags with structured for- 
mat code to provide a structured document. 

[0007] The steps of automatically correcting accord- 
ing to typesetting rules, automatic tagging, and automat- 
ic conversion allow for a highly automated process for 
bringing a document from a standard word processor 
format to a structured format. This allows the document 
author to use a word processor which he or she is fa- 
miliar with, and divorces him or her from structured for- 
mat techniques. These steps also help to ensure that 
errors are minimised. 

[0008] In one embodiment, conversion is performed 
by automatic comparison of tags with reference tags 
stored in look-up tables. 

[0009] Preferably, the conversion step includes the 
sub-steps of recognising foreign objects in the docu- 
ment, exporting the foreign objects to a separate proc- 
ess, converting the foreign objects to a text format, and 
subsequently importing the text and processing the text 
to convert to the structured format. 
[0010] In one embodiment, the conversion step com- 
prises the sub-step of separately converting floating el- 
ements according to document parameter values and 
structure of the floating element. 
[0011] These automatic conversion steps in se- 
quence provide comprehensive conversion to a struc- 
tured format. 

[0012] Preferably, the process comprises the further 
step of parsing the structured format code for final vali- 
dation. This helps to ensure document quality. 
[0013] In one embodiment, the document parameter 
values are written as an array of flags to load an activa- 
tion table which activates and deactivates parameter 
options. This is a very effective way of recording param- 
eter values for a particular document. 
[001 4] Preferably, the tagging step involves automatic 
recognition of elements. 

[001 5] In one embodiment, the process comprises the 
further step of copy-editing the document after tagging 
by automatically converting words according to a break- 
down of the word characters. 

[0016] In another embodiment, the copy-editing step 
includes the sub-steps of building an array of document 
references by automatic recognition and subsequently 
sorting them according to an operator-inputted sort cri- 
terion. 

[0017] Preferably, the process comprises the further 
step of automatic p re-conversion validation, in which 
tags are compared with reference tags and nesting is 
validated according to the document parameter values. 
[001 8] In one embodiment, the pre-conversion valida- 
tion step includes the sub-step of automatically locating 
any invalid symbols and generating corresponding error 
messages. 
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[0019] In another embodiment, the pre-conversion 
validation step includes the sub-step of automatically 
identifying references, building an array in memory, and 
searching to determine if any do not exist in the docu- 
ment. 

[0020] According to another aspect, the invention pro- 
vides a document production system comprising a proc- 
essor having an editor interface and memory access 
means, the processor comprising:* 

means for writing a document comprising charac- 
ters in a word processor format to memory; 

means for writing document publication parameter 
values to memory 

means for automatically correcting the document 
according to typesetting rules; 

means for automatically tagging the document to 
delimit character strings by inserting tags next to the 
associated character strings, the tagging being per- 
formed according to the publication parameter val- 
ues; and 

automatically converting the document to a struc- 
tured format by substituting tags with structured for- 
mat code to provide a structured document. 

DETAILED DESCRIPTION OF THE INVENTION 

Brief Description of the Drawings 

[0021] The invention will be more clearly understood 
from the following description of some embodiments 
thereof, given by way of example only with reference to 
the accompanying drawings in which:- 

Figs. 1 (a), 1 (b) and 1 (c) are together a flow chart 
illustrating a production process of the invention; 
and 

Figs. 2 to 4 are samples of a document at various 
process stages. 

Description of the Embodiments 

[0022] The drawings show a process 1 for producing 
a document in a structured format, in this embodiment 
SGML. The process 1 is carried out by a conventional 
hardware system such as a PC or a client/server net- 
work architecture. The hardware is programmed for 
typesetting as described in the following description of 
the process 1 . 

[0023] The process takes an authored document in a 
particular word processor format (input A), or in a differ- 
ent word processor format or a manually type-written 
document (input B). If input B, the process in step 2 con- 



verts the document to the particular word processor for- 
mat by optical character recognition or word processor 
conversion as applicable. Fig. 2 is a sample from a re- 
ceived document in Word ™ format. As is clear from Fig, 
s 2, this is totally conventional as the author can work with 
a conventional word processor and needs no knowl- 
edge of structured formats. 

[0024] In step 3 process identifiers are inputted by an 
operator. These identifiers identify the particular docu- 
10 ment being produced for publication, the client and other 
identification information. 

[0025] In step 4, a parameter activation table is load- 
ed. This table includes flags which activate or deactivate 
various document parameter values . The rules and the 

15 table are structured to represent Document Type Defi- 
nition (DTD) information in the system so that DTD in- 
formation may be automatically processed. In most in- 
stances, the activation table is loaded simply by the op- 
erator selecting a publisher (client) and the system au- 

20 tomatically loading the values. This is performed in a 
matter of seconds and requires little operator skill, even 
through the parameter values deal with complex type- 
setting and publication technical issues. 
[0026] In step 5 the document is automatically pre- 

25 pared and cleaned. This involves the system processor 
applying typesetting rules such as removing multiple 
spaces. In addition, various rules are applied for con- 
sistency such as removing spaces from around mathe- 
matical symbols. Also, spelling mistakes are corrected 

30 using a spell-checker program. Tables and figures are 
moved to the end of the document to facilitate later 
processing steps. 

[0027] In step 6 the document is tagged with internal 
system tags. The system progresses through document 

35 sections in sequence, in this embodiment the frontmat- 
ter, bodymatter, backmatter, tables and figures and the 
cross references. The tags are subsequently of benefit 
in automatically converting the document to SGML To 
perform tagging, the system generates prompts for the 

40 operator and, based on the operator's responses and 
internally stored rules and tables, the system recognises 
elements of the sections and tags them accordingly 
[0028] An important aspect of the system operation to 
automatically tag in step 6 and to perform automatically 

45 recognition in subsequent process steps is the underly- 
ing pattern matching method. The parameter values in- 
putted in step 4 set a sequence of sections which the 
system expects to read. This sequence determines ac- 
tivation of programs in sequence, each relating to a doc- 

50 ument section. Each such program is dedicated to the 
associated section and accesses a dedicated set of rel- 
atively small look-up tables. This pattern-matching ac- 
tion allows very fast pattern recognition and so the sys- 
tem can typically make one pass through a document - 

55 involving both recognition and insertion in a matter of 
several seconds. Also, this pattern matching technique 
is modular as it allows editing of patterns on a section- 
by-section basis. 
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[0029] An example of a tagged section is shown in 
Fig. 3. It will be clear trom Fig 5 that tagging is quite 
comprehensive. Each tag identifies an element of the 
document structure. The elements are typically sub-di- 
visions of document sections such as frontmatter. For 
example, <snm> tags surname, <frm> tags forename, 
<pub.name> tags the publisher name, <sbt> tags sub- 
title start, and </sbt> tags subtitle end. The solid rectan- 
gular symbol tags end of flat text of the document in cer- 
tain parts such as the author section. It will be appreci- 
ated that tagging is achieved very quickly despite the 
complex nature of document information because of use 
of code and look-up tables dedicated to sections of the 
documents. These are in turn set by the loaded param- 
eter activation table. 

[0030] Another aspect of the tagged document is that 
the text and the tags themselves are each displayed by 
the system with a highlighting colour which indicates the 
nature of what is being displayed. The colours are not 
recognised by the system for automatic processing, but 
instead are generated by the system to allow operator 
interaction in a comprehensive and quick manner. The 
colours allow the operator to immediately visually delimit 
the tagged elements and to quickly intervene if errors 
arise. To allow such interaction, the system pauses 
processing for limited periods of several seconds at spe- 
cific intervals. 

[0031] In step 7 the system performs copy-editing. 
This involves spell-checking and grammar-checking the 
document. The processor operates according to a find/ 
replace program which automatically breaks down char- 
acter strings to validate internal fonts used. For exam- 
ple, the author may mean x 2 3 but may have used 
x 2 3, x2 3 ) or x 2 3 at different places in the text. The system 
converts all instances of x23 into their correct form. As 
part of the editing step 7, the system converts styles in 
the document into their correct form as required by the 
document parameter values. A particular example is 
bibliographic reference style. Some publishers require 
these references to be name/date references, while oth- 
ers have these references numbered. For example, if 
the first reference in a document is a reference to an 
article published by "Smith and Jones 0 in 1998 in the 
name/date format the text for this in the bibliography 
group would be ordered alphabetically and so would 
therefore be about half way down the list. On the other 
hand, in a numbered reference format, the text would 
be at the start of the bibliography list as it is the first one 
cited. The system prompts the operator to select be- 
tween these styles and then automatically implements 
them by generating a list of all of the references and 
sorting them accordingly. Finally, the editing step 7 in- 
volves pulling all floating elements to the end of the doc- 
ument to facilitate faster handling at a later stage in the 
process. 

[0032] This work completes a preparatory stage of the 
process and this stage is then verified as illustrated in 
Fig. 1 (b) in steps 8 to 15. In step 8 the tags are auto- 
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matically compared with an internally-stored set of ref- 
erence tags. This comparison is performed according to 
the received document parameter values. The order 
and nesting of the tags are checked in steps 9 and 10, 

s again according to the document parameter values. In 
step 1 1 symbols within the document are checked to lo- 
cate any unknown one. This is performed by automated 
searching for characters which are not in the ranges 1 -9, 
a-z, or A-Z and do not match a list of valid characters 

10 held by the system. Any unknown characters found in 
the document are reported for correction. 
[0033] In step 12 cross-references are checked for 
validity. Cross references include bibliographic refer- 
ences and references to tables, figures, and footnotes. 

is This involves the system making a list of the items re- 
ferred to in the memory. The system then checks each 
reference in the body of the document. The system re- 
ports on references that cite any non-existent items and 
items that should be referred to but are not. As for steps 

20 8 to 11 , errors found are reported. However, in addition 
to step 12 there is an additional step 13 in which a list 
of unlinked cross-references is generated to prompt 
feedback by the operator. Generation of error messages 
is indicated by the step 14, and correction by step 15. 

25 The correction may involve interactive input by the op- 
erator. 

[0034] Referring now to Fig. 1(c), the final phase of 
the process is illustrated. In step 20 every symbol and 
character not in the 1-9, a-z, and A-Z ranges, are 

30 checked against a list to locate the SGML code for that 
character. The SGML code is substituted in the text au- 
tomatically. In step 21, tags which were inserted in the 
preparation stage of the process are converted to their 
SGML equivalent. Again, this is automated because the 

35 tags are simply checked against a list in a look-up table 
and substituted. In step 22 equations and foreign ob- 
jects in the document are converted to their correct 
SGML tags. This involves the system transmitting com- 
mands to convert the object into a format which can be 

40 understood by an application. For example, for a math- 
ematical equation, a command is sent to a "MathType™" 
application to convert the equation into a text equivalent 
of the object's code. The system then converts this into 
SGML by searching the (now text) object and process 

45 sub-objects. Floating elements are converted to SGML 
and are embodied in the SGML document at the correct 
position in step 23. For example, the document param- 
eter values may require the "floats" to be at the end of 
the body of the document, while others require each 

50 float to be located immediately after the first reference 
to it. The floats are converted based on rules held in 
memory. These rules are taken from both document pa- 
rameter values and the float structure so that, for exam- 
ple, tables will always have cells and rows and this struc- 

55 ture is used in the process. 

[0035] A sample of an SGML file is shown in Fig. 4. It 
will be clear from Fig. 4 that a structured document is 
quite complex and requires specialist knowledge. If the 
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structured document were to be produced manually, it 
would be a very time-consuming exercise and would al- 
so be error-prone. 

[0036] In step 24 the SGML file is passed through a 
parserto ensure that the SGML is perfectly, correct. This 
parser is a tool which exhaustively checks and validates 
the file against the complete document parameter val- 
ues. This ensures that the correct set of document pa- 
rameter values are used as are the various rules held 
by the system. This acts as a system check and reports 
any errors. 

[0037] An intermediate-output SGML file is provided 
in step 25 and this is used as the basis for the final out- 
put. For example, there may be DTD-specific conver- 
sion in step 27 to provide a final output SGML file in step 
28. Alternatively, there may be journal-specific conver- 
sion in step 29 with typeset code editing in step 30 and 
a postscript output generated in step 31 . Thus, the out- 
put SGML file may be converted into the typeset code 
required to correct style and display the document for a 
typesetting system. Because the document provided in 
step 26 is in SGML format, many alternatives are pos- 
sible. 

[0038] It will be appreciated that the invention pro- 
vides a process which generates a structured document 
in a highly-controlled manner whereby little operator skill 
is required. This in effect bridges the gap between au- 
thoring a document and having it ready for publication. 
The author can work in his or her preferred manner with- 
out the need to have any knowledge of the publication 
process and structured formats - the process taking the 
author's output and generating the structured docu- 
ment. Another important advantage of the invention is 
the fact that the output structured document is of excel- 
lent quality because of the automatic validation steps of 
the process. 

[0039] In summary, the process allows a typesetter/ 
publisher take an authored document in any format and 
generate a structured document for publication very 
quickly, without the nnedfor highly skilled operators, and 
with excellent quality. 

[0040] The invention is not limited to the embodiments 
described, but may be varied in construction and detail 
within the scope of the claims. 

Claims 

1. A document production process carried out by a 
system comprising a processor having an editor in- 
terface and memory access means, the processor 
comprising the steps of:- 

writing (2, 3) a document comprising characters 
in a word processor format to memory; 

writing document publication parameter values 
(4) to memory; 



automatically correcting (5) the document ac- 
cording to typesetting rules; 

automatically tagging (6) the document to de- 
5 limit character strings, by inserting tags next to 

the associated character strings the tagging be- 
ing performed according to the publication pa- 
rameter values; and 

10 automatically converting (20 -24) the document 

to a structured format by substituting tags with 
structured format code to provide a structured 
document. 

2. A process as claimed in claim 1, wherein conver- 
sion is performed by automatic comparison of tags 
with reference tags stored in look-up tables. 

3. A process as claimed in claims 2, wherein the con- 
20 version step includes the sub-steps (22 - 23) of rec- 
ognising foreign objects in the document, passing 
the foreign objects to a separate process, convert- 
ing the foreign objects to a text format, and subse- 
quently processing the text to convert to the struc- 

25 tured format. 

4. A process as claimed in claim 1 , wherein the con- 
version step comprises the sub-step of separately 
converting floating elements (23) according to doc- 

30 ument parameter values and structure of the float- 
ing element. 

5. A process as claimed in claim 1 , wherein the proc- 
ess comprises the further step of parsing the struc- 

35 tured format code for final validation (24). 

6. A process as claimed in claim 1 , wherein the docu- 
ment parameter values are written (4) as an array 
of flags to load an activation table which activates 

40 and deactivates parameter options. 

7. A process as claimed in claim 1 , wherein the tag- 
ging step involves automatic recognition of ele- 
ments. 

45 

8. A process as claimed in claim 1 , wherein the proc- 
ess comprises the further step of copy-editing (7) 
the document after tagging by automatically con- 
verting words according to a break-down of the 

so word characters. 

9. A process as claimed in claim 8, wherein the copy- 
editing step (7) includes the sub-steps of building 
an array of document references by automatic rec- 

55 ognition and subsequently sorting them according 
to an operator-inputted sort criterion. 

10. A processor as claimed in claim 1 t comprising the 
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further step of automatic pre^conversion validation 
(8 - 13), in which tags are compared with reference 
tags and nesting is validated according to the doc- 
ument parameter values. 

5 

11. A processor as claimed in claim 1 0, wherein the pre- 
conversion validation step includes the sub-step of 
automatically locating any invalid symbols and gen- 
erating corresponding error messages. 

10 

12. A process as claimed in claim 11 , wherein the pre- 
conversion validation step includes the sub-steps 
of automatically identifying references, building an 
array in memory, and searching to determine if any 

do not exist in the document. *5 

13. Documents whenever produced by a process as 
claimed in any preceding claim. 

14. A document production system comprising a proc- 20 
essor having an editor interface and memory ac- 
cess means, the processor comprising: - 

means for writing a document comprising char- 
acters in a word processor format to memory; 25 

means for writing document publication param- 
eter values to memory 

means for automatically correcting the docu- 30 
ment according to typesetting rules; 

means for automatically tagging the document 
to delimit character strings by inserting tags 
next to the associated character strings, the 35 
tagging being performed according to the pub- 
lication parameter values; and 

automatically converting the document to a 
structured format by substituting tags with 40 
structured format code to provide a structured 
document. 
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